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Abstract 

This  paper  studies  issues  concerning  parallel  execution  of  regular  Fortran  DO  loops  on  an  aisynchronous  shared- 
memory  multiprocessor,  where  each  iteration  is  the  basic  unit  to  be  executed  by  a  single  processing  element. 
An  iteration  is  a  dependent  predecessor  of  another  iteration  if  execution  of  the  latter  iteration  hss  to  wait  until 
execution  of  the  former  iteration  has  completed.  During  the  execution  of  a  DO  loop,  an  iteration  will  pass 
through  four  states,  namely,  idle,  pending,  ready,  and  finished  states.  An  iteration  is  idle  if  none  of  its  dependent 
predecessors  have  completed;  an  iteration  is  pending  if  some  of  its  dependent  predecessors  have  completed,  but 
not  all;  an  iteration  is  ready  if  all  its  dependent  predecessors  have  completed,  but  itself  has  not;  otherwise,  an 
iteration  is  finished.  In  addition,  an  iteration  without  any  dependent  predecessors  is  called  an  initial  iteration, 
which  can  only  have  ready  and  finished  states.  Via  describing  an  execution  scheme,  this  paper  studies  the 
characteristics  of  Fortran  DO  loops  which  are  related  to  the  efficiency  of  the  execution.  Specifically,  this  paper 
investigates  (1)  the  number  of  initial  iterations,  (2)  the  majcimum  number  of  ready  iterations  at  any  instances 
during  the  execution,  (3)  the  maximum  number  of  pending  iterations  at  any  instances  during  the  execution,  (4) 
a  hash  function  to  disperse  different  pending  iterations,  and  (5)  the  parallel  execution  time. 


'  This  research  was  partially  supported  by  NSF  under  grant  number  CCR-89-6949. 
^Author's  electronic  addresses:  ouyangSispunky.cs.nyu.edu,  (212)  99&-3083. 


1      Introduction 


2      Background 


As  DO  loops  account  for  the  major  parallelism 
in  Fortran  programs,  how  to  execute  a  DO  loop 
efficiently  in  parallel  environments  is  an  impor- 
tant issue.  Researches  on  executing  DO  loop- 
s  have  been  done  for  various  environments,  includ- 
ing systolic  arrays[6],  vector  processors[2,8],  VLIW 
processors[l  ,11],  synchronous  multiprocessors[3],  and 
asynchronous  multiprocessors[7,9]. 

There  are  many  DO  loops  where  the  dependence  dis- 
tances between  iterations  can  be  determined  at  compil- 
ing time[6,9].  Let  us  call  this  type  of  DO  loops  regular 
DO  loops.  This  paper  studies  issues  concerning  execut- 
ing regular  Fortran  DO  loops  on  asynchronous  shared- 
memory  multiprocessors  such  as  Ultracomputer[4].  Al- 
though executing  DO  loops  on  environments  such  as 
vector  processors  and  \'LIW  processors  has  appealing 
results,  the  execution  of  DO  loops  on  cisynchronous 
shared-memory  multiprocessors  is  interesting  because 
which  might  be  the  only  available  machines.  In  addi- 
tion, when  the  DO  loops  contain  IF  statements  whose 
conditions  are  hard  to  be  accurately  predicted  true  or 
false  at  compiling  time,  it  is  hard  to  execute  DO  loops 
efficiently  on  vector  processors  and  VLIW  processors. 

The  execution  scheme  used  in  this  paper  is  quite  s- 
traightforward:  each  execution  unit  is  an  iteration'  and 
is  scheduled  to  be  executed  by  a  free  processor  greedily. 
Although  the  concept  of  the  scheme  is  simple,  several 
nontrivial  issues  have  to  be  considered,  namely,  what  is 
the  space  needed  to  implement  such  scheme,  how  does 
one  execution  unit  inform  its  dependent  successors  in  an 
efficient  way,  and  how  to  determine  if  the  parallel  execu- 
tion is  superior  to  the  sequential  execution.  Such  issues 
of  executing  regular  Fortran  DO  loops  on  asynchronous 
shared-memory  multiprocessors  cannot  be  found  in  the 
existent  literatures  and  are  the  subjects  of  this  paper. 

The  organization  of  this  paper  is  eis  follows.  Section 
2  defines  the  model  and  terms  used  in  our  discussion,  as 
well  as  presents  an  overview  of  our  execution  scheme. 
In  section  3,  the  data  structure  used  in  our  scheme  is 
examined  in  detail.  Specifically,  the  number  of  initial, 
ready,  and  pending  iterations  are  calculated,  and  a  haish 
function  to  disperse  different  pending  iterations  are  p- 
resented.  Section  4  shows  the  time  needed  to  execute  a 
DO  loop  under  our  execution  scheme.  Finally,  section 
5  gives  a  conclusion. 


'  If  the  execution  time  of  a  single  iteration  is  too  small  when 
compared  to  the  synchronization  time,  several  iterations  can  be 
grouped  into  an  execution  umt[5].  This  is  another  subject  of 
researches  and  we  will  not  discuss  it  here. 


The  loops  that  will  be  considered  in  this  paper  are  nor- 
malized DO  loops  as  below: 

DO   /i    =   1.    t/i 

DO  /2   =   1.    t/j 

DO   /„   =   1.    f/„ 

loop  body 
ENDDO 


ENDDO 
ENDDO 

Our  execution  scheme  takes  each  iteration  as  the  ba- 
sic unit  to  be  scheduled  for  execution.  In  other  words, 
there  are  totally  [7?=!  ^i  execution  units  in  the  above 
loop.  For  convenience  of  our  presentation,  an  itera- 
tion will  be  represented  by  its  values  of  induction  vari- 
ables. For  example,  iteration  [i\,i2,  ■  ■  ■  ,in\  means  the 
iteration  when  induction  variable  Ik  is  equal  to  i^,  for 
1  <  k  <  n.  Furthermore,  the  nested  DO  loop  will  be 
modelled  by  an  iteration  space  and  several  dependence 
vectors.  The  iteration  space  corresponding  to  the  above 
loop  is  the  Cartesian  space  [1,  C^]  x  [1,  (/2]  x  •  ■  •  x  [1,  t/n]i 
and  a  dependence  vector  di  =  [dn, . . .  ,di„]  for  the 
above  loop  is  used  to  describe  that  iteration  [si, . . . ,  Sn] 
must  be  executed  after  iteration  [si  —  <i,i, . . . ,  s„  —  din]- 
We  call  iteration  [si  —  tf,i, . .  .  ,«„  —  din]  the  dependent 
predecessor  of  iteration  [si , . . . ,  «„].  In  our  execution 
scheme,  an  iteration  can  be  executed  only  after  all  it- 
s  dependent  predecessors  have  been  completed.  Dur- 
ing the  execution  of  a  DO  loop,  an  iteration  will  pass 
through  four  states,  namely,  idle,  pending,  ready,  and 
finished  states.  An  iteration  is  idle  if  none  of  its  depen- 
dent predecessors  have  completed;  an  iteration  is  pend- 
ing if  some  of  its  dependent  predecessors  have  complet- 
ed, but  not  all;  an  iteration  is  ready  if  all  its  dependent 
predecessors  have  completed,  but  itself  has  not;  other- 
wise, an  iteration  is  finished.  In  addition,  an  iteration 
without  any  dependent  predecessors  is  called  an  ini- 
tial iteration,  which  can  only  have  ready  and  finished 
states.  It  is  useful  to  represent  an  iteration  space  and 
its  associated  dependence  vectors  as  a  dependence  graph 
G  =  (V,  E),  where  each  vertex  in  V  corresponds  to  an 
iteration  in  the  iteration  space,  and  <  t;i,ii2  >  is  in  E 
if  the  iteration  corresponding  to  t'l  is  a  dependent  pre- 
decessor of  the  iteration  corresponding  v^-  A  longest 
path  of  G  is  a  path  p  =  iiofi  . . .  v;  such  that  Vi  £  V 
for  0  <  I  <  /,  <  Vi,vi+i  >e  £•  for  0  <  i  <  /  -  1,  and 
/  is  the  maximum  over  all  such  paths.    Note  that  the 


parallel  execution  time  of  a  DO  loop  can  be  expressed 
as  a  function  of  the  length  of  the  longest  path. 

Example  1  Consider  the  following  program: 

DO  h  =  1,  1" 

DO  h  =  1.  17 

a(/i,  /2)  =  a(/i  -  l,/2  -  3)  +  a(/i  -  3,/2  -  1) 

ENDDO 
ENDDO 

The  associated  iteration  space  is  [1,17]  x  [1,17],  and 
the  associated  dependence  vectors  are  [1,3]  and  [3,1]. 
In  figure  1,  the  iteration  space  is  represented  by  a  17  x 
17  grid.  All  the  iterations  shown  in  figure  1(a)  by  D's 
are  initial  iterations,  and  all  the  iterations  shown  in 
figure  1(b)  by  Q)'s  can  be  at  ready  state  simultaneously. 
Note  that  the  number  of  Q's  in  figure  1(b)  is  also  the 
maximum  number  of  ready  iterations  at  any  instances 
during  the  execution.  Finally,  also  shown  in  figure  1(b) 
by  a  dashed  line  is  one  of  the  DO  loop's  longest  paths, 
which  have  the  length  equal  to  8.  D 

Now  let  us  give  an  overview  of  our  execution  scheme. 
The  execution  scheme  uses  the  following  three  "pools" 
to  store  iterations: 

•  INIT:  a  data  structure  used  to  store  initial  itera- 
tions. 

•  READY:  a  FIFO  queue  used  to  store  ready  itera- 
tions. 

•  PENDING:  a  data  structure  used  to  store  pending 
iterations. 

Note  that  an  iteration  can  be  stored  by  its  induction 
variables  instead  of  the  whole  loop  body.  The  execu- 
tion scheme  is  as  follows.  A  free  processor  first  try  to 
fetch  an  iteration  from  INIT  to  execute.  If  INIT  is 
empty,  then  an  iteration  is  fetched  from  the  READY. 
If  READY  is  also  empty,  then  the  corresponding  nested 
DO  loop  has  been  finished.  Whenever  a  processor  fin- 
ishs  executing  an  iteration  ci,  it  installs  each  dependent 
successor  si  as  follows: 

Algorithm  1 

TVy  to  find  the  entry  for  si  in  PENDING; 
if  SI  is  not  in  PENDING  then 

if  Si  has  no  other  dependent  predecessors  then 

install  SI  in  READY; 
else 

install  SI  in  PENDING; 
endif 
else 

At  the  SI  entry  in  PENDING,  mark  ci  finished; 


if  si  has  no  other  unfinished  dependent 

predecessors  then 

move  SI  from  PENDING  to  READY; 
endif 
endif; 


D 


The  above  execution  scheme  is  quite  naive  but  seems 
unavoidable  in  the  execution  of  nested  DO  loops  under 
asynchronous  shared-memory  multiprocessor  environ- 
ments. However,  some  techniques  can  be  imposed  on 
this  execution  scheme  to  improve  the  performance.  As 
mentioned  in  section  1,  the  iterations  can  be  grouped 
into  larger  execution  units  to  compromise  between  com- 
putation and  synchronization  times.  In  addition,  the 
data  structure  READY  can  be  implemented  as  a  pri- 
ority queue  where  the  priority  of  a  ready  iteration  is  a 
function  of  its  expected  execution  time,  the  number  of 
its  dependent  successors,  the  length  of  the  longest  path 
to  an  iteration  without  dependent  successors,  and  so 
forth.  This  paper  will  not  discuss  these  techniques,  yet 
which  can  be  applied  to  our  scheme  easily  when  needed. 

3      Properties   of  Initial,    Ready, 
and  Pending  Iterations 

In  this  section,  characteristics  of  initial,  ready,  and 
pending  iterations  are  studied.  First  of  all,  our  exe- 
cution scheme  needs  to  identify  initial  iterations  and 
put  them  in  the  data  structure  INIT.  A  naive  method 
to  find  all  such  iterations  would  be  sweeping  through 
each  of  the  iterations  to  check  if  it  has  dependent  pre- 
decessors. This  is  described  by  the  following  algorithm: 

Algorithm  2 

for  each  iteration  [/i , . . . ,  /„]  in  the  iteration  space 
[l,[/i]x  •■•X  [l,[/„]  do 

For  each  dependence  vector  d,  =  [d,i,d,2,  ■■  ■  .d,„], 
for  1  <  i  <  m,  check  if  [/i  —  d,i, . . . .  !„  —  d,n]  is 
within  the  range  [1,  L'l]  x  ■  ■  ■  x  [1,  U„]: 
If  any  dependence  vector  makes  the  condition 
satisfied,  then  the  iteration  [7i, . . . ,  /„]  should 
not  be  put  in  INIT; 

otherwise,  [/i, . . . ,  /„]  should  be  put  in  INIT; 
end  /♦  for  */ 

D 

In  spite  of  the  simplicity  of  the  above  algorithm,  the 
time  needed  to  execute  the  algorithm  is  ©(f^j  J^"_j  Ui). 
Much  time  can  be  saved  if  we  can  avoid  scanning  non- 
initial  iterations.  Therefore,  we  need  to  know  which 
iterations  are  initial  iterations.  From  Example  1  in 
Section   2,   it   can   be  observed   that   initial  iterations 
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Figure  1:  The  initial  iterations,  ready  iterations,  and  longest  path  of  a  DO  loop  with  iteration  space  [1, 17]x  [1, 1' 
and  dependence  vectors  [1,3]  and  [3,1]. 


locate  at  the  "borders"  of  the  iteration  space.  For 
example,  consider  an  iteration  space  [1,10]  x  [1,10]. 
For  a  dependence  vector  [2,0],  the  set  of  iterations 
{[i,y]|l  <  I  <  2,1  <  y  <  10}  will  not  depend  on 
any  other  iterations  via  the  dependence  vector  [2,0]. 
Hence  this  set  of  iterations  will  consist  of  all  the  initial 
iterations  if  there  is  only  one  dependence  vector  [2,0]. 
Suppose  we  have  another  dependence  vector  [1,-3], 
then  the  set  of  iterations  {[x,j/]|l  <  x  <  1,1  <  y  < 
10}  U  {[x,y]  1 1  <  X  <  10,8  <  y  <  10}  will  not  de- 
pend on  any  other  iterations  via  the  dependence  vector 
[1,-3].  As  an  initial  iteration  does  not  depend  on  any 
other  iterations  via  any  dependence  vectors,  the  set  of 
initial  iterations  for  the  dependence  vectors  [2,0]  and 
[1,-3]  is  {[x,y]|l  <  x<  2,1  <y  <  10}  n  ({[x,y]  |  1  < 
a^  <  1.1  <  y  <  10}u{[x,y]|l  <  X  <  10,8  <  y  <  10}). 
The  following  theorem  formally  state  which  iterations 
are  initial  iterations: 

Theorem  1  Let  S  =  [1,  t/i]x[l,  t/2]x  ■■■  x  [!,[/„]  be  the 
iteration  space,  di  =  [(f,i,  (i,2,  . .  . ,  din],  for  I  <  i  <  m, 
be  m  dependence  vectors.  Let 

(  {[ei,...,e„]e5|l  <e,  <  d„)  if  dij  >0 

/.>=<     0  ./rf.;=0 

[  {[ex e„]  e  S\Uj+d,,  <  tj  <  Uj  }     tf  d,j  <  0 

Then  an  iteration  k  is  m  I  =  ("11  =  1  U?=i  ^>3  'f  ond  only 
tf  iteration  k  ts  an  initial  iteration  in  S. 


proof.  U  k  =  [ky, . . .  ,kn]  is  in  7,  we  show  that  it  is 
impossible  that  it  depends  on  any  other  iterations  in 
the  iteration  space  S.  Suppose  on  the  contrary  that  it 
depends  on  some  other  iteration,  then  there  must  exist 
an  iteration  k'  =  [k[, . . . ,  k'„]  such  that  k  =  k'  +  d,  for 
some  fixed  s.  Since  k  £  fli^i  U?=i  ^>] '  ^  must  be  in  /,( 
for  some  fixed  t.  Since  k  =  k'  +  d,,  v/e  have  Jt(  =  k[  +  d,t. 
However, 

•  if  d,t  >  0,  then  k[  -  ki  -  d„  <  d,,  -  d,t  =  0; 

•  if  <i,t  =  0,  then  /,(  =  0,  ^  cannot  be  in  /,(; 

•  if  d,t  <  0,  then  it;  =  it,  -  d„  >  Ut  +  d,t  -  d„  -  JJf 

These  cases  imply  that  k  cannot  depend  on  any  other 
iterations  in  S,  contradicting  to  our  eissumption.  Hence 
we  conclude  that  k  does  not  depend  on  any  other  iter- 
ations in  S. 

Conversely,  if  /:  0  /,  then  k  ^  U?=i  ^»J  ^°''  some  fixed 
s.  In  other  words,  k  ^  I,,  for  all  1  <  j  <  n.  Let  us 
consider  the  following  cases: 

•  if  d,j  >  0,  then  d,j  <  k,  <  Uj; 

•  if  (f.j  -  0,  then  1  <  /tj  <  Uj; 

•  if  d,j  <  0,  then  I  <  kj  <  Uj  +  d,j. 

Define  k'  =  [ki  —  d,i, .  .  .  ,kn  —  d,n]-  It  is  clear  that  k' 
must  be  in  the  iteration  space  S  and  k  =  k'  +  d,.  That 
is,  k  depends  on  k'.  This  completes  our  proof.  D 


An  algorithm  generating  all  the  initial  iterations  is  else 
described  next.  For  clarity,  let  us  describe  the  algo- 
rithm by  an  example  first.  Consider  an  iteration  s- 
pace  [1,10]  X  [1,10]  x  [1,10]  and  dependence  vectors 
di  =  [0,2, 3],  <f2  =  [1,-1, 2]  and  da  =  [3, 1,1].  Let  di  be 
denoted  by  [dn ,  di^ ,  di3]  for  1  <  i  <  3.  The  algorithm 
recursively  divide  each  dimension  into  regions  until  it 
reaches  the  last  dimension.  Initially,  dimension  1  is  di- 
vided into  three  regions  [1,1],  [2,3],  and  [4,10]  according 
to  dii,  ^21  and  dai.  By  doing  this  way,  we  have 

•  For  each  iteration  in  the  subspace  [1,1]  x  [1, 10]  x 
[1, 10],  it  may  depend  on  other  iterations  only  via 

•  For  each  iteration  in  the  subspace  [2,3]  x  [1,10]  x 
[1, 10],  it  may  depend  on  other  iterations  only  via 
di  or  ^21 

•  For  each  iteration  in  the  subspace  [4, 10]  x  [1, 10]  x 
[1,10],  it  may  depend  on  other  iterations  via  di, 
d2,  or  dz- 

With  dimension  1  restricted  to  the  region  [1,1],  dimen- 
sion 2  will  be  divided,  using  di2  only,  into  [1,2]  and 
[3,10].  By  doing  this  way,  we  have 

•  For  each  iteration  in  the  subspace  [1,1]  x  [1,2]  x 
[1, 10],  it  may  not  depend  on  any  other  iterations; 

•  For  each  iteration  in  the  subspace  [1,1]  x  [3,10]  x 
[1, 10],  it  may  depend  on  other  iterations  only  via 

Finally,  with  dimension  1  and  2  restricted  to  the  re- 
gion [1,1]  and  [1,2]  respectively,  initial  iterations  in  the 
region  [1, 1]  x  [1,2]  x  [1, 10]  are  generated.  Table  1  sum- 
marizes the  generation  procedure  for  this  example  and 
Algorithm  3  describes  the  general  procedure. 

Algorithm  3  Let  [l.C^i]  x  [1,1/2]  x  ■■■x[l,f/„]  be  the 
iteration  space,  di  =  [dii,di2,  ■  ■  ■  ,di„],  for  1  <  i  <  m, 
be  T7J  dependence  vectors.  The  algorithm  will  gener- 
ate the  initial  iteration  indices  by  recursive  calls  to  the 
procedure  'iteT{k,D).  Let  xi Xn,yi,  ■■  ■  ,yn  be  glob- 
al variables.  At  the  main  routine,  iter(l,  {di, . . . ,  dm}) 
is  called. 

procedure  iter(fc,  D) 

/•  it  is  the  depth  of  the  loop  under  considered  */ 
/♦  D  is  a  set  containing  dependence  vectors  */ 
if   it  =  n  then 

x„  =max({l}  U  {Un  +  d,n  +  \\d,  £  D,d,„  <  0}) 
y„  =  min({C/„}  U  {d,„\d,  G  D,d.„  >  0}) 
generate  iterations  in  the  Cartesian  space 

[xi.yi]  X  ■■•  X  [x„,y„]\ 


T={[f(d,k),d,]\d,  &D,d,ki^Q,\d,k\<Uk}  where 
f(d,k)  = 


Uk  +  d,k     if  d,k  <  0 
D'  =  {d,  eD\-  Uk  <d,t<0  }; 
finished  =  false; 
yk  =  0; 
while  (not  finished)  do 

Xifc  =  yifc  +  1; 

if  T  =  0  then 

yk  =  Uk; 
else 
yt  =  smallest(T);       /*  smallest(T)  is  the 
smallest  value  among  all  /(diib)'s  in  the 
elements  of  T.  */ 
iter(it-|-l,  £>'); 
if  T  =  0  then 

finished  —  true; 
else 
for  each  {f{dik),di]  e  T  that  /(d.t)  =  yk  do 
if  d.t  >  0  then 

insert  d,  into  D'; 
else  /*  d,i  <  0  ♦/ 
delete  d,  from  D'\ 
endif 

remove  [/(d,jb),  d,]  from  T\ 
end 
endif 
end  /*  while  */ 
endif 


D 


The  above  algorithm  is  faster  than  algorithm  1  be- 
cause those  iteration  indices  which  do  not  belong  to 
INIT  ar^  not  generated.  However,  algorithm  3  does 
waste  time  for  sweeping  through  "empty  blocks"  when 
the  last  dimension  n  is  empty.  In  this  case,  algorith- 
m  3  can  be  revised  so  that  the  role  of  dimension  n  is 
replaced  by  a  nonempty  dimension  k  where  Xk  is  al- 
ways less  than  or  equal  to  yk-  In  addition,  generating 
the  INIT  need  not  be  accomplished  at  compiling  time. 
Algorithm  3  can  be  updated  to  fit  into  a  run-time  self- 
scheduling  scheme,  which  eliminates  the  space  needed 
for  INIT. 

Next  we  consider  the  space  requirement  for  the  da- 
ta structure  READY.  To  determine  the  space  require- 
ment, we  have  to  know  the  maximum  number  of  ready 
iterations  at  any  instances  during  the  execution: 

Theorem  2  Lei  S  =  [1,  t/i]x  [1,  t/2]x  •  •  x  [1,  t/„]  be  the 
Heraiion  space,  d;  =  [d,i,  d^T,  ■  ■  ■ ,  d^n],  for  I  <  i  <  m, 
be  m  dependence  vectors.    Then  the  maximum  number 


dimension  1 

dimension  2 

dimension 

3 

{''ll,<^21,<^3l} 

[1,1] 

{di2} 

1.2] 

{} 

1,10] 

3,10] 

{^.3} 

1.3 

[2,3] 

{di2,d22] 

1.2! 

{d2z} 

1,2 

3.9] 

{^13.^23} 

1.2 

10.10] 

{^13} 

1,3 

[4,10] 

{di2,d22,d32) 

1.1 

{d23} 

1.2 

2.2 

{'^23.C'33} 

1.1 

3.9 

{d],3-d23-d33} 

1.1 

10,10] 

{dis-djj} 

1.1' 

Table  1:    Generating  initial  iterations  for  the  iteration  space  [1,10]  x  [1,10]  x  [1,10]  and  dependence  vectors 
d:  =  [0,2,3],  ^2  =  [1--1,2]  and  ds  =  [3.1,1]. 


of  ready  tieralions  at  any  tnsiances  during  the  execu- 
tion IS  less  than  or  equal  to  min{J~[,_j  t'j  —  nj  =  i(^^;  ~ 
\d,,\)\l<i<m}. 

proof.  Let  lij  be  the  same  as  in  theorem  1.  Then  for 
each  dependence  vector  d, ,  U"_i/,;  is  the  set  of  itera- 
tions which  do  not  depend  on  any  other  iterations  via 
di.  In  addition,  note  that  when  each  iteration  com- 
pletes, at  most  one  iteration  can  be  activated  via  de- 
pendence vector  di.  Therefore,  the  mzLximum  number 
of  ready  iterations  at  any  instances  during  the  execu- 
tion is  less  than  or  equal  to  min{  |  U"_i  hj  \\^  S  ^  S  "^l, 
where  |  U"_i  lij  \  denotes  the  size  of  the  set  U"_i/ij. 

Without  loss  of  generality,  we  assume  that  dik  >  0 
for  1  <  J  <  m,  and  1  <  it  <  n  below  in  computing 
the  value  of  |  U"_i  /,;  |.  According  to  the  definition 
of  lij.  we  have  U^^iA;   =  U'^^■^{[xl, . . .  ,x„]  G  5|1  < 

X,  <d.;}  =  {[xi x„]6  5|V7^i(l  <X;  <d.;)}  = 

S-{[xi,...,x„]eS\A'^^i(d,j  <Xj  <Ui)}.  Therefore, 
we  have  |  U^^^  /.;  |  =  n;=i  U,  -  X\%AU,  -  d„).  This 
completes  our  proof.  D 

From     Theorem     2,      it     is     enough     to     allocate 

cmin{U%,l'j  -  n;=i(^j  -  l'^.;l)|l  <«<'"}  unit 
space  to  READY,  where  c  is  a  constant  representing 
the  space  need  for  each  entry  of  iterations.  Note  that 
the  value  obtained  in  Theorem  2  also  represents  the 
maximum  number  of  processing  elements  that  can  be 
used  simultaneously  under  our  execution  scheme. 

Now  let  us  determine  the  space  required  by  PEND- 
ING. To  do  this,  we  have  to  know  the  maximum  possi- 
ble number  of  pending  iterations  at  any  instances  dur- 
ing the  execution: 

Theorem  3  Let  S  =  [1,  t'i]x  [l.t^jx  •  x  [!,[/„]  be  the 
iteration  space,  di  —  [dii,di2, .  ■  ■  ,d,n],  for  I  <  i  <  m. 
be  m  dependence  vectors.  Then  the  number  of  pending 
iterations  at  any  instances  during  the  execution  cannot 
exceed  Er=i(n;.:  f ;  -  H^iC^;  "  M.;l))- 


proof.  When  an  iteration  becomes  pending,  it  must 
be  '"activated"  by  one  of  its  dependent  predecessors. 
For  a  dependence  vector  d,,  the  number  of  pend- 
ing iterations  that  are  activated  by  d,  cannot  exceed 
n?=i  t';  —  n?=i(^0  ~  \dij\)'  as  can  be  seen  from  The- 
orem 2.  Therefore,  the  total  number  of  pending  itera- 
tions cannot  exceed  El^i(n"=i  f;  -n"=i(^'^;  "  I'^t;!))- 
D 

From  the  above  theorem,   it  is  enough  to  allocate 

cEr=i(n;=it';  -  n;=i(f'i  -  moD)  unit  space  to 

PENDING,  where  c  is  a  constant  representing  the  s- 
pace  need  for  each  entry  of  iterations. 

In  addition  to  deciding  the  space  bound  for  PEND- 
ING, we  also  need  an  access  scheme  to  accomplish  the 
action  "try  to  find  the  entry  for  5i  in  PENDING"  in  al- 
gorithm 1.  The  following  algorithm  describes  the  access 
scheme: 

Algorithm  4  Let  c  =  [ci,...,Cn]  =  [H^i^^ii- 
■  ■■  ^Yl'tLi  dtn]-  Assume  that  \ci\  <  [',  for  1  <  «  <  n. 
which  should  account  for  most  cases  in  practice.  We 
will  represent  PENDING  as  a  hash  table  with  entries 
indexed  by  1,2, . .  . ,  P,  where  P  is  equal  to  n?=i  ^  ;  ~ 

YYj^ii^i  ~  l^jD-  Then  an  iteration  [xi x„]  will  be 

assigned  to  a  bucket  of  the  entry  indexed  by  r,  where  r 
is  computed  as  below  (  assume  that  Cj  >  0  for  1  <  ;  <  n 
for  clarity):" 

/*  Find  the  initial  iteration  [j/i , . . .  ,j/n]  which  is  the 
ancestor  of  [xi , . . . ,  Xn]  via  dependence  vector  c  */ 

a  =  min{[^^J  |  1  <  J  <  n,  Cj  ?^  0); 

p  =  min{j|[^^^J  =  a  for  1  <  j  <  n  and  Cj  ^  0}; 

[yi,.  ■  ■  ,yn]  =  [xi  -  a*ci,...,x„  -  a  *  c„]\ 
I*  find  s,  the  sum  of  sizes  from  block  1  to  block  p  —  1, 
where  block  :  is  the  Cartesian  space  [c\  -1-1,  U\\  x  ■  •  • 


^We  wiU  define  E.g*-^**)  =  0'^d  W.^^K')  =  1- 


X  [C<_i  +  l,t/._i]x  [\,Ci]x[l,Ui  +  i]x  ■••X  [l,f/„]  */ 

*  =  Er:/(n*;;(t/;-c;)c,n;=,^jt/;); 

/*  find  t,  the  address  of  [j/i, . . .  ,3/n]  within  block  p, 
where  block  p  is  the  Cartesian  space  [ci  +  1,  t/i]  x  •  •  • 
X  [cp_i  +  1,  t/p-i]  X  [l,cp]  X  [1,  t/p+i]  X  ■  ■  -x  [1,  L/„]  ♦/ 
let  [ii,  ...,2n]  =  [yi  -  ci,..  .,i/p-i  -Cp_i,yp,. ..  ,y„]; 

let  [vi ,Vn]  = 

[Ui-ci,...,Up-\  -Cp_i,Cp,f/p+i,...,t/„]; 

r=  s  +  <  +  1; 


D 


Each  of  the  initial  iterations  for  the  vector  c  will  be 
mapped  to  a  unique  number  in  the  range  from  1  to  P, 
which  can  be  observed  from  the  fact  that  the  algorithm 
in  essence  just  divides  the  initial  iterations  for  depen- 
dence vector  c  into  at  most  n  n-dimensional  "blocks", 
and  then  orders  the  iterations  in  each  block  according  to 
row-major  order.  With  this  access  scheme,  it  is  expect- 
ed that  each  entry  of  the  hash  table  will  usually  store 
only  one  iteration.  This  is  because  in  most  Ccises,  all  the 
dependent  predecessors  of  iteration  I  -\-  di  +  . . .  -\-  dm 
have  iteration  /  as  a  dependent  ancestor.  Hence  when 
iteration  /  is  pending,  iteration  I  +  di  +  . . .  +  dm  can- 
not become  pending  as  it  requires  at  least  one  of  its 
dependent  predecessor  be  finished,  which  in  turn  re- 
quires iteration  /  be  finished.  An  entry  of  the  hcish 
table  may  contain  more  than  one  iterations  only  when 
those  iterations  near  the  iteration  space  boundaries  are 
being  executed  and  some  components  of  dependence 
vectors  are  negative.  For  example,  for  the  iteration  s- 
pace  [1,10]  X  [1,10]  x  [1,10]  and  dependence  vectors 
[0,1,-2],  [1,-2,1],  and  [1,0,2],  iteration  [2,2,2]  and 
iteration  [4, 1,3]  may  be  at  the  pending  states  simulta- 
neously. 

Finally,  to  support  the  above  access  scheme,  we  prove 
that  the  number  of  entries  in  PENDING  is  less  than  or 
equal  to  the  bound  we  got  in  Theorem  3,  i.e.,  Y[j  =  i  ^i  ~ 

n;=i(^;  - 1 Er=i <^.;i)  <  Er=i(n;=i  u,  -  wuiu,  - 

|(i,j|)).  We  need  a  lemma  first: 

Lemma  4  Lei  [\,U\\  x  ■••  x  [\,Un\  be  the  iter- 
ation space,  c/i  =  [diijdi^,. .  ■  ,di„]  and  ^2  = 
[d2i,d^2,  ■  ■  ■  ,d2n]  be  two  dependence  vectors.  Also  as- 
sume that  Uj  >  \dij\  -I-  \d-,j\  for  I  <  j  <  n.    Then  we 

have  n;=i(t/,  -  Ml; I) -f  n;=i(t^;  -  M2; i)  -  n"=iiu,  - 

proof.  See  appendix.  D 

With  this  lemma,  we  now  show  that  the  number  of 
entries  in  PENDING  is  less  than  or  equal  to  the  bound 
we  got  in  theorem  3: 


Theorem  5  Let  S  =  [l,{/i]x[l,C/2]x  •  •  •x[l,  C/„]  be  the 
iteration  space,  di  =  [dn,  di2,  ■  ■  ■  ,  di„],  for  1  <  i  <  m, 
be  m  dependence  vectors.  Assume  that  Uj  >  YlT=i  \d'j\ 
for  I  <  j  <  n-    Then  we  have  n?=i  ^i  ~  n?=i(^;  ~ 

I  Er=i  d.j\)  <  Er=i(n;=,  u,  -  u-liiu,  -  m). 


proof 


nuu,-nuiu,-\j:7^^d,A) 

< 

<  Er=,(n;=,t/;-n;=,(t^.-M.;i)) 


4      Execution  Time  of  DO  Loops 

In  this  section,  we  will  consider  when  the  parallel  ex- 
ecution is  superior  to  the  sequential  execution  of  the 
nested  DO  loops.  Because  of  the  synchronization  cost, 
parallel  execution  of  a  nested  DO  loop  is  not  necessar- 
ily faster  than  sequential  execution  of  the  same  loop. 
Therefore,  the  compiler  should  estimate  both  parallel 
and  sequential  time  to  make  the  right  choice. 

Let  t  denote  the  average  execution  time  of  an  iter- 
ation, s  denote  the  synchronization  time  required  by 
algorithm  1,  and  /  denote  the  number  of  iterations  on 
the  longest  path  in  the  dependence  graph.  Then  the 
sequential  execution  time  is  <ni=i  ^^'  ^^^  ^^^  parallel 
execution  time  is  (t  +  ms)/,  where  n  is  the  depth  of 
the  nested  DO  loop,  (7,'s  are  the  upper  bounds  of  in- 
duction variables,  and  m  is  the  number  of  dependence 
vectors.  Therefore,  parallel  execution  is  preferred  to 
sequential  execution  when  (t  -\-  ms)l  <  tY[i=i  ^''  '•^■' 

rf"  u- 


when 


-  1. 


The  remaining  question  now  is  how  to  find  the  value 
of/,  that  is,  the  length  of  the  longest  path  in  the  depen- 
dence graph.  First  of  all,  let  us  define  the  set  of  source 
iterations,  C,  for  the  longest  path.  For  each  iteration 
Si  =  [sii,  •  •  •  1  «in]  €  C,  Sij  could  only  be  either  1  or  Uj. 
Specifically,  s,j  could  be  1  only  when  there  exists  some 
dij  >  0,  Sij  could  be  Uj  only  when  there  exists  some 
dij  <  0.  If  for  some  fixed  dimension  j  and  all  depen- 
dence vectors  d,,  d,j's  are  all  zeros,  then  dimension  j 
can  be  regarded  as  nonexistent  in  solving  this  problem. 
The  following  theorem  tells  us  how  to  find  the  upper 
bound  of  the  longest  path: 


Theorem  6  ie/  5  =  [1,  f/i]  x  •  •  ■  x  [1,L'„]  be  the  itera- 
tion space,  d,  =  [d.i, . .  . ,  d.n],  for  I  <  i  <  m,  be  m  de- 
pendence vectors.  In  addition,  let  C  =  {si,  sn,  .  .  . ,  Sa) 
be  the  set  of  source  iterations.  Then  the  length  of 
the  longest  path  m  the  dependence  graph  cannot  exceed 
majc{t;i , .  . .  ,  fo},  where  v,  is  obtained  by  solving  the  fol- 
lowing integer  programming  problem: 

max         v,-  =  xi  +  •^2  +  •  •  ■  +  ^m 
subject  to 

1  <  «it  +  EJLi  ^j(ijk  <Uk      for  1  <  <:  <  n 


>  0 


for  1  <  j  <  m 


proof.  In  a  dependence  graph,  any  iteration  reachable 
from  Sj  can  be  expressed  as  s,+Xidi  +  X2£/2  +  '  ■  -'rXmdm, 
where  Xj  is  greater  than  or  equal  to  0  for  1  <  j  <  rn. 
According  to  the  definition  of  the  longest  path,  all  the 
iterations  in  a  longest  path  must  be  within  the  iteration 
space  S.  Specifically,  the  "sink"  iteration  of  the  longest 
path  must  be  within  the  iteration  space  S  also.  That 
is. 


1  <sik  +  ^x,djk  <  Uk 


for  1  <  it  <  n 


Therefore,  the  length  of  the  longest  path  starting  from 
Si  in  the  dependence  graph  cannot  exceed  the  value  u, 
defined  by  the  above  integer  programming  problem. 

Now  we  show  that  considering  only  the  source  iter- 
ations in  C  is  sufficient.  Suppose  there  is  a  longest 
path  from  a  =  [ai,...,a„]  to  a  +  YlT=i  ^i^i'  where 
I  <  Qk  <  Uk  for  some  it.  Let  us  consider  the  following 
cases: 

•  If  1  <  Oit  <  at  +  X27=i  ^j'^i''  —  ^*'  ^^^^  there 
must  exist  some  dj  such  that  d^k  >  0.  Con- 
sequently, there  must  exist  some  s,  €  C  such 
that  s,k    =    1-     In  addition,   it  is  obvious  that 

•  If  1  <  ajt  -I-  YiJLi  ^jdjk  <  Ok  <  Uk,  then  there 
must  exist  some  dj  such  that  djk  <  0.  Con- 
sequently, there  must  exist  some  s;  €  C  such 
that  Sik   =  Uk-     In  addition,  it  is  obvious  that 

\<Uk  +  T.7=,xjdjk<Uk. 

•  If  1  <  at  =  a*  -I-  ^'j'-iXjdjk  <  Uk,  then  we 
have  1   =   1-1-  J2T=i  ^I'^ii'   <  ^*  ^"<^  I  <  Uk  + 

All  these  cases  imply  that  we  can  find  a  longest  path 
with  its  source  iteration  in  C  and  having  the  length  no 
less  than  ij  +  ■  •  ■  +  Xm-  This  completes  our  proof.  D 
Some  comments  follows.  First,  since  the  leftmost 
nonzero  entry  of  a  dependence  vector  is  always  positive, 
the  graph  is  acyclic.   As  a  consequence,  the  maximum 


value  of  the  integer  programming  problem  is  always 
bounded.  Second,  efficiency  can  be  improved  by  using 
linear  programming  methods  such  as  simplex  method  to 
find  the  value  of/.  For  large  f7,'s  and  small  d,j's,  which 
should  be  the  common  cases,  the  error  is  expected  to  be 
small.  Finally,  since  most  nested  DO  loops  have  depth 
less  than  or  equal  to  3  [10],  it  is  usually  that  no  more 
than  4  source  iterations  need  to  be  considered  ^. 


5      Conclusion 

In  an  asynchronous  multiprocessor,  executing  a  nested 
DO  loop  involves  scheduling  iterations  to  be  executed 
as  soon  as  possible.  This  kind  scheduling  is  necessary 
even  if  grouping  is  applied.  Although  the  concept  is 
easily  understood,  the  implementation  of  such  execu- 
tion scheme  involves  several  issues.  This  paper  first 
sketches  the  execution  scheme  in  section  2,  and  then 
discusses  the  implementation  issues  in  section  3.  where 
the  space  bound  for  the  implementation  is  found,  as 
well  as  an  addressing  scheme  is  proposed.  Because  of 
the  synchronization  costs,  parallel  execution  of  a  nest- 
ed DO  loop  is  not  necessarily  beneficial.  The  choice  of 
whether  or  not  executing  a  DO  loop  in  parallel  is  stud- 
ied in  section  4,  where  finding  the  length  of  the  longest 
path  in  a  dependence  graph  is  transformed  to  a  couple 
of  integer  programming  problems. 

Some  issues  remain  to  be  studied.  First,  it  is  pre- 
ferred to  find  a  lower  space  bound  for  the  data  structure 
PENDING,  as  well  as  its  associated  addressing  scheme. 
Second,  a  more  efficient  and  accurate  algorithm  to  find 
the  length  of  the  longest  path  for  a  dependence  graph 
is  wanted.  Thirdly,  we  would  like  to  extend  our  execu- 
tion scheme  to  more  generic  DO  loops  where  an  outer 
loop  r.n  contain  several  single  statements  and  other 
DO  loops  inside.  Finally,  to  improve  the  efficiency  of 
the  execution,  it  is  also  imp>ortant  to  consider  hardware 
supports  for  execution  schemes. 
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Appendix 

Proof  of  Lemma  4- 

Define  At  =  {j\l  <  j  <  k.  di^dzj  <  0,  and  \dij\  >  \d:>,\}.  Bk  =  {j\l  <  j  <  k.  d^jdnj  <  0,  and  [d^jl  >  \dij\}, 
and  Ck  =  {i|l  <  j  <  k,  d\jd2,  >  0}.  We  prove  by  induction  on  k  that 

<       n;  =  lt';   -  Y\,^ASUi  -  |dl;|  +  |d2;l)n;efl.(t^i  "  l'^2;  I  +  Ml;  1)0;^^  (^j  "  \dlA  "  1^2,1) 

+  n;e^.(f'';  -  Mi;i)n;6B.(t';  -  M2;l)aec.(t';  -  I'^i;!  -  l'^2;|)  (1) 

When  k  =  1,  (1)  is  true  since,  by  the  definition  of  dependence  vectors,  dn  >  0  and  Jti  >  0. 
Suppose  (1)  is  true  when  ifc  =  p  —  1,  we  prove  that  (1)  is  still  true  when  k  =  p. 
Case  1:  d\pd2p  >  0: 

=   '^'piWjZli'-'j  -  \du\)  +  n;:J(^'.  -  i-^^.i)  -  WjZlc^'}  -  \d^^  +  d2A)) 

-  \d^AY[]Z\(U:  -  \d,A)  -  \d2AW,Z\i^'^  '  I'^^^D  +  (I'^'pI  +  \d2p\)Y[]Z\(^':  '  Mu  +  d2A) 

-  \d^p\W,Z\^U,  -  \d,A)  -  \d2p\W~l^^'^  -  \d2A)  +  i\dxp\  +  \d2p\)W,Z\^'-''  -  \d^^  +  d2A) 
[  by  induction  hypothesis   ] 

-  (Up  -  Idipl  -  \d2p\]Y\,^^,.S^^  -  I'^^^l  +  \d2A)U,^s,.,iU,  -  \d,A  +  Mi.l)n,€c,_.(t'^  -  I'^ul  -  \d2A) 
+  ^Vn^e^^-.C^^  -  l''i^l)n,eB,-,(t'^  -  \d2A)U:ec,.,i'-':  "  l-^ul  -  \d2A) 

-  l'^>pin,e^,-,(^^  -  l'''^l)n;€B,_.(t/,  -  \d2A)n,ec,.,(^^  -  \d^^\  -  \d2A} 

-  i''=pin;e^,_,(f.  -  M..I)n,€B,..(^^  -  \d2A)n,^c,.A^':  -  \d^A  -  \d2A) 
[  because      W^Zl(U,  -  \d,,  +  d^A)  = 

n,6^,.,(f';   -  \dlA  +  M2,|)n,eB,..(t'.  -  \d2A  +  \dlA)n,eC,.A^'^  -  \dlA  -  M2;|), 

-n,eA,_,(^'.  -  l"^^.!)  <  -U,eA,.,(^^  -  \d^Al  -n,eB,.,(tO  -  \d.A)  <  -n,eB,.,(f'.  -  \d2A)- 

+ n,6^,(t^.  -  i''ui)n,eB,(t^^  -  \d2A)n,^c,(^',  -  m^i  -  m^.d 

Case  2:  dipcfop  <  0  and  |dip|  >  |d2p|: 

(t'p  -  idipi)n':,'(to  -  I'^ui) + {Up  -  M2pi)n;:j(to  -  m^.d  -  (tv  -  m:p + ^^pDn^rjcc^  -  id,, + <<,,i) 

=    Up(W,Zli'-',  -  \d^A)  +  n;:'(C0  -  M2;l)  -  YV'li^^  -  \di,  +  d2A)) 

-  \d^p\IY,Zl(U,  -  Mul)  -  \d2p\W,Zla'^  -  \d2A)  +  {\dip\  -  \d2p\)W^:l(u,  -  \d,,  +  d2A) 

<  UpiU'-l'-'^  -  U,eA,.^'^^  -  \diA  +  M^.I)aeB,_.(fO  -  \d2A  +  M>.naec,.,(f'.  -  M';l  -  l-i^.l) 

+  n,6^,.,(t^.  -  M'.l)n;eB,_,(t^.  -  \d2A)n,ec,.,iU^  -  M>.l  -  M^.D) 

-  id^piu'ziiu,  -  \d^A)  -  M2pin;:'(t^.  -  \d2A) + (m.pI  -  M2pi)n;:'(fo  -  mu + d,A) 

[  by  induction  hypothesis  ] 

<  UpU^Zl^; 

-  (t/p  -  \d,,\  +  \d2p[)U,eA,_,(^^  -  \diA  +  \d2A)U,^s,.,(^'^  -  i'^^.l  +  \du\)U,^c,.,i^'^  "  Mul  -  l''2,|) 


[   because      W^Zli^: -\dij  +  d2,\)  = 

+ u,eA^^^  -  \d^A)u,eB^^^  -  \d^A)n,ec,(^'^  -  i^^'^i  -  I'^^^d 

Case  3:  dipd^p  <  0  and  \d2p\  >  \dip\:  This  case  is  similar  to  case  2  and  hence  is  omitted. 

This  completes  our  proof  of  inequality  (1).  Since  rL6yik(^j  ~  Mul  +  I'^s;!)  njeBi(t'^;  -\'i'2j\  +  \(^ij\)Tlj^Ci.i^J  ~ 
\dij\  -  \d2,\)  >  UjeAAU,  -  IdijDUjeBS^J  -  M2;l)n,ec.(f^i  -  Mi;l  -  l'^^;!)  for  any  k,  the  lemma  is  proved  by 
setting  A;  to  n.  D 
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